Choosing the reference level wisely

For each categorical variable in a multiple regression model, the program considers one of the

categories to be the reference level and evaluates how each of the other levels affects the outcome,

relative to that reference level. Statistical software lets you specify the reference level for a

categorical variable, but you can also let the software choose it for you. The problem is that the

software uses some arbitrary algorithm to make that choice (such as whatever level sorts

alphabetically as first), and usually chooses one you don’t want. Therefore, it is better if you instruct

the software on the reference level to use for all categorical variables. For specific advice on

choosing an appropriate reference level, read the next section, “Recoding categorical variables as

numerical.”

Recoding categorical variables as numerical

Data may be stored as character variables — meaning the variable for primary diagnosis (PrimaryDx)

may be contain character data, such as Hypertension, Diabetes, Cancer, and Other. Because it is

difficult for statistical programs to work with character data, these variables are usually recoded with

a numerical code before being used in a regression. This means a new variable is created, and is

coded as 1 for hypertension, 2 for diabetes, 3 for cancer, and so on.

It is best to code binary variables as 0 for not having the attribute or state, and 1 for having the

attribute or state. So a binary variable named Cancer should be coded as Cancer = 1 if the

participant has cancer, and Cancer = 0 if they do not.

For categorical variables with more than two levels, it’s more complicated. Even if you recode the

categorical variable from containing characters to a numeric code, this code cannot be used in

regression unless we want to model the category as an ordinal variable. Imagine a variable coded as 1

= graduated high school, 2 = graduated college, and 3 = obtained post-graduate degree. If this variable

was entered as a predictor in regression, it assumes equal steps going from code 1 to code 2, and from

code 2 to code 3. Anyone who has applied to college or gone to graduate school knows these steps are

not equal! To solve this problem, you could select one level for the reference group (let’s choose 3),

and then create two binary indicator variables for the other two levels — meaning one for 1 =

graduated high school and 2 = graduated college. Here’s another example of coding multilevel

categorical variables as a set of indicator variables, where each level is assigned its own binary

variable that is coded 1 if the level applies to the row, and 0 if it does not (see Table 17-1).

TABLE 17-1 Coding a Multilevel Category into a Set of Binary Indicator

Variables

StudyID PrimaryDx

HTN Diab Cancer OtherDx

1

Hypertension 1

0

0

0

2

Diabetes

0

1

0

0

3

Cancer

0

0

1

0

4

Other

0

0

0

1

5

Diabetes

0

1

0

0